Context: I started playing the piano in 2018 as a complete beginner and I’ve been tracking my practice time for around 2 and a half years. I now decided to put that to good use and see what interesting patterns I might be able to find.
Disclaimer: I am not affiliated with Toggl. I started using it a few years ago because it provided all the functionality I needed and loved its minimalistic design. The standard membership, which I use, is free of charge.
knitr::opts_chunk$set(
echo = TRUE, # show all code
tidy = FALSE, # cleaner code printing
size = "small", # smaller code
fig.path = "figures/", #graphics location
out.width = "100%",
message = FALSE,
warning = FALSE
)raw_data%>%
group_by(Month_format)%>%
summarise(Total_Duration = sum(Duration)/60)%>%
mutate(Total_Duration2 = as.integer(cumsum(Total_Duration)),
max = as.integer(max(Total_Duration2)),
max = ifelse(max > Total_Duration2, "", max))%>%
ggplot(aes(Month_format, Total_Duration2, group = 1))+
geom_line(size = 2, color = "#69b3a2")+
geom_point(size = 5, color = "#69b3a2")+
geom_area(alpha = 0.3, fill = "#69b3a2")+
# grade 3
geom_point(x="Oct\n '18", y = 253, size = 5, color = "dark red")+
geom_text(x="Oct\n '18", y = 253+200, size = 5, label = "Grade 3")+
geom_text(x="Oct\n '18", y = 253+100, size = 5, label = "253 hours")+
# grade 5
geom_point(x="Oct\n '19", y = 675, size = 5, color = "dark red")+
geom_text(x="Oct\n '19", y = 675+200, size = 5, label = "Grade 5")+
geom_text(x="Oct\n '19", y = 675+100, size = 5, label = "675 hours")+
# grade 6
geom_point(x="Oct\n '20", y = 1078, size = 5, color = "dark red")+
geom_text(x="Oct\n '20", y = 1078+200, size = 5, label = "Grade 6")+
geom_text(x="Oct\n '20", y = 1078+100, size = 5, label = "1078 hours")+
# NOW
geom_point(aes(x="Apr\n '21", y = 1219), size = 5, color = "dark red")+
geom_text(aes(label = max), nudge_y = 75, nudge_x = -0.5, size = 5)+
scale_fill_gradient(low="yellow", high="red")+
labs(x = NULL,
y = "Total hours of practice",
title = "Piano practice timeline")+
theme_ipsum_es()+
theme(legend.position = "top")Based on the level at the time and the difficulty of the piece, we can see that each piece took around 10-30 hours of practice.
raw_data%>%
filter(Date_Start > as.Date("2018/11/01"))%>%
filter(Completed == "Yes")%>%
group_by(Project, Date_Start)%>%
summarise(Duration = sum(Duration)/60)%>%
mutate(Cumulative_Piece = cumsum(Duration),
Month_Year = as.factor(as.yearmon(Date_Start)),
Month_format = str_replace(Month_Year, " 20", "\n '"))%>%
ungroup()%>%
mutate(Cumulative_Total = cumsum(Duration))%>%
filter(Project %notin% c("Technique", "General", "Sightreading"))%>%
left_join(model_data%>%select(Level, Project, ABRSM), by = "Project")%>%
ggplot(aes(Date_Start, Cumulative_Piece, fill = Level)) +
geom_point(size = 10, shape = 21, col = "black", alpha = 0.5) +
scale_size(range = c(.1, 16), guide = FALSE) +
labs(title = 'Year: {frame_time}',
y = "Total practice time per piece (hours)")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es() +
theme(legend.position = "top")+
transition_time(Date_Start) +
ease_aes('linear')+
exit_fade() +
shadow_mark(alpha = 0.1, size = 5)# save animation as gif for later use
# anim_save("figs/timeline.gif")Generally, I’ve done pretty well to maintain a high level of consistency with the exception of August/December. This is usually where I tend to take annual leave.
raw_data%>%
filter(Source != "Estimated")%>%
group_by(Month_Year, Month_Start, Month_format)%>%
summarise(Days_Practice = n_distinct(Date_Start),
Total_Duration = sum(Duration, na.rm = TRUE))%>%
mutate(Days_Total = days_in_month(Month_Start),
Days_Not_Practiced = Days_Total - Days_Practice,
Avg_Duration = as.integer(Total_Duration/Days_Total),
Consistency = round(Days_Practice / Days_Total * 100,2),
Consistency_Status = ifelse(Consistency<75, "Bad", "Good"),
Month_format = reorder(Month_format, Month_Year))%>%
ggplot(aes(Month_format, Consistency, fill = Consistency_Status))+
geom_col(group = 1, col = "black")+
geom_hline(yintercept = 75, lty = "dashed")+
geom_text(aes(label = Days_Not_Practiced), size = 5, nudge_y = 3)+
labs(x = NULL,
fill = "Consistency status",
subtitle = "Numbers indicate days without any practice within each month")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "top")We can see that my practice time was correlated with the consistency, where the average session was much shorter in the months I was away from the piano. There’s also a trend where my practice close to an exam session was significantly higher than any other time of the year. Can you spot in which month I had my exam in 2019? What about the end of 2020?
average practice length per month includes the days in which I did not practice
raw_data%>%
filter(Source != "Estimated")%>%
group_by(Month_Year, Month_Start, Month_format)%>%
summarise(Days_Practice = n_distinct(Date_Start),
Total_Duration = sum(Duration))%>%
mutate(Days_Total = days_in_month(Month_Start),
Avg_Duration = as.integer(Total_Duration/Days_Total),
Avg_Duration_Status = ifelse(Avg_Duration < 60, "Less than one hour", "One hour"),
Month_format = reorder(Month_format, Month_Year))%>%
ggplot(aes(Month_format, Avg_Duration, fill = Avg_Duration_Status))+
geom_col(col = "black")+
labs(x = NULL,
y = "Average practice session length (min)",
fill = "Status")+
geom_text(aes(label = Avg_Duration), nudge_y = 5, size = 5)+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "top",
axis.text.y = element_blank(),
axis.ticks.y = element_blank())Similar trends as before are apparent where my average daily session is longer before the exams than any other time in the year and a dip in the months where I usually take most of my annual leave. I really do need to start picking up the pace and get back to where I used to be.
raw_data%>%
group_by(Month_Year, Month_Start, Month_format, Month_Name, Year)%>%
summarise(Days_Practice = n_distinct(Date_Start),
Total_Duration = sum(Duration))%>%
mutate(Days_Total = days_in_month(Month_Start),
Avg_Duration = as.integer(Total_Duration/Days_Total),
Avg_Duration_Status = ifelse(Avg_Duration < 60, "Less than one hour", "One hour"),
Month_format = reorder(Month_format, Month_Year),
size = as.factor(ifelse(Year == 2018, 1, 1.5)),
label = ifelse(month(Month_Start) == 1, as.character(Year), ""))%>%
ggplot(aes(Month_Name, Avg_Duration, group = Year, size = size))+
geom_line(aes(col = Year))+
geom_label_repel(aes(label = label, col = Year))+
labs(x = NULL,
fill = "Status")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "none")Despite a similar median, we can see that the practice sessions were less likely to be over 80 min after COVID. We can test if this was a significant impact with a t-test.
covid_start <- as.Date("2020/03/23")
inference <- raw_data%>%
filter(Source != "Estimated")%>%
mutate(Covid_Status = as.factor(ifelse(Date_Start < covid_start, "Before COVID", "After COVID")),
Covid_Status = reorder(Covid_Status, desc(Covid_Status)))%>%
group_by(Covid_Status, Date_Start)%>%
summarise(Duration = sum(Duration))%>%
ungroup()
ggplot(inference, aes(Covid_Status, Duration, fill = Covid_Status))+
geom_boxplot(varwidth = TRUE, col = "black")+
labs(x = NULL,
y = "Average practice session (min)")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "none")Given the extremely low p-value, the Shapiro-Wilk normality test implies that the distribution of the data is significantly different from a normal distribution and that we cannot assume the normality. However, we’re working with the entire population dataset for each class and thus, unlike the independence of data, this assumption is not crucial.
inference %>%
select(Covid_Status, Duration) %>%
group_by(group = as.character(Covid_Status)) %>%
do(tidy(shapiro.test(.$Duration)))%>%
kbl(caption = "Shapiro-Wilk normality test")%>%
kable_paper("hover", full_width = F)| group | statistic | p.value | method |
|---|---|---|---|
| After COVID | 0.9607325 | 3e-07 | Shapiro-Wilk normality test |
| Before COVID | 0.9549818 | 0e+00 | Shapiro-Wilk normality test |
We can see that with a large p value, we should fail to reject the Null hypothesis (Ho) and conclude that we do not have evidence to believe that population variances are not equal and use the equal variances assumption for our t test
tidy(leveneTest(inference$Duration~inference$Covid_Status))%>%
kbl(caption = "Levene's test")%>%
kable_paper("hover", full_width = F)| statistic | p.value | df | df.residual |
|---|---|---|---|
| 0.0410026 | 0.8395891 | 1 | 732 |
My practice sessions post-COVID are significantly shorter than those before the pandemic. This might be surprising, given that we were in the lockdown most of the time. However, I’ve been spending my time doing a few other things such as improving my technical skillset with R (this analysis wouldn’t have been possible otherwise) and learning italian.
t_test <- inference%>%
t_test(Duration ~ Covid_Status, var.equal = TRUE)%>%
add_significance()%>%
kbl()%>%
kable_paper("hover", full_width = F)
t_test| .y. | group1 | group2 | n1 | n2 | statistic | df | p | p.signif |
|---|---|---|---|---|---|---|---|---|
| Duration | Before COVID | After COVID | 433 | 301 | 3.319481 | 732 | 0.000947 | *** |
graph_practice <- function(variable, nudge){
raw_data%>%
filter(Genre %notin% c("Other", "Not applicable"))%>%
group_by({{variable}})%>%
summarise(Duration = as.integer(sum(Duration)/60))%>%
arrange(desc(Duration))%>%
head(10)%>%
ggplot(aes(reorder({{variable}}, Duration), Duration, fill = Duration))+
geom_col(show.legend = FALSE, col = "black", width = 1)+
geom_text(aes(label = Duration), show.legend = FALSE, nudge_y = nudge, size = 5)+
scale_fill_gradient(low="yellow", high="red")+
labs(x = NULL,
y = "Total hours of practice")+
coord_flip()+
theme_ipsum_es()+
theme(axis.text.x = element_blank(),
axis.ticks = element_blank())
}
graph_practice(Genre, 15)graph_practice(Composer, 5)graph_practice(Project, 3)Simplified, ABBRSM grades are a group of 8 exams based on their difficulty (1 - beginner to 8 - advanced). There are also diploma grades but those are extremely advanced, equivalent to university level studies and out of the scope of this analysis.
More information can be found on their official website at https://gb.abrsm.org/en/exam-support/your-guide-to-abrsm-exams/
model_data%>%
mutate(Duration = Duration)%>%
ggplot(aes(ABRSM, Duration, fill = ABRSM))+
geom_boxplot(varwidth = TRUE, outlier.colour = "red")+
labs(x = "ABRSM Grade",
y = "Total practice hours",
subtitle = "The higher the difficulty, the more time it takes to learn a piece")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "none")A further aggregation of ABRSM grades; this is helpful given the very limited dataset within each grade and much easier on the eye. This is an oversimplification but they’re classified as: * 1-5: Beginner (1) * 5-6: Intermediate (2) * 7-8: Advanced (3)
model_data%>%
mutate(Duration = Duration)%>%
ggplot(aes(Level, Duration, fill = Level))+
geom_boxplot(varwidth = TRUE, outlier.colour = "red")+
scale_color_tron()+
scale_fill_tron()+
labs(x = "Level",
y = "Total practice hours",
subtitle = "The higher the difficulty, the more time it takes to learn a piece")+
theme_ipsum_es()+
theme(legend.position = "none")model_data%>%
ggplot(aes(Length, Duration, group = 1))+
geom_jitter(aes(col = Level), width = 0.5, height = 0.5, size = 3)+
geom_smooth(method = "lm", se=FALSE)+
labs(x = "Piece length (mins)",
y = "Hours needed to learn a piece",
subtitle = "There appears to be a linear trend between piece length and total practice time")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "top")We can spot a trend where the time required to learn a piece of a similar difficulty (ABRSM Grade) decreases as my ability to play the piano increases (as judged by cumulative hours of practice). We should keep this in mind and include it as a variable into our prediction model.
model_data%>%
ggplot(aes(Cumulative_Duration, Duration, group = 1))+
geom_point(aes(col = Level), size = 3)+
geom_smooth(method = "lm", se=FALSE)+
labs(x = "Cumulative hours practiced before the first practice of each piece",
y = "Hours needed to learn a piece",
subtitle = "Pieces of a similar difficulty become faster to learn")+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
theme(legend.position = "top")How do we differentiate between pieces that we learn once and those that we come back to repeatedly? Examples could include wanting to improve the playing further, loving it so much we wanted to relearn it, preparing it for a new performance, etc.
As anyone that ever played the piano knows, re-learning a piece, particularly after you “drop” it for a few months/years, results in a much better performance/understanding of the piece. I definitely found that to be true in my experience, particularly with my exam pieces.The downside is that these pieces take longer to learn.
model_data%>%
mutate(Project_formatted = str_replace_all(Project,"[^[:graph:]]", " "),
Project_label = as.factor(ifelse(Max_Break > 31, Project_formatted, "")))%>%
ggplot(aes(as.integer(Max_Break), Duration, col = Max_Break <= 31))+
geom_point(size = 3)+
geom_text_repel(aes(label = Project_label), size = 3, show.legend = FALSE)+
scale_x_log10()+
scale_color_tron(labels = c(TRUE, FALSE))+
guides(colour = guide_legend(reverse=TRUE))+
labs(x = "Maximum days passed between two consecutive sessions on the same piece (log scale)",
y = "Hours needed to learn a piece",
col = "Break (over 1 month)",
subtitle = "Taking a break before finishing a piece might lead to more hours required to learn it")+
theme_ipsum_es()+
theme(legend.position = "top")model_data%>%
select(-Days_Practiced, -Date_End, -Max_Break)%>%
mutate(Duration = round(Duration),
Length = round(Length, 1))%>%
arrange(desc(Date_Start))%>%
rename(Experience = Cumulative_Duration,
Started = Date_Start)%>%
relocate(Project, Duration, Genre, ABRSM, Level, Standard, Length, Experience, Break, Started)%>%
kbl(escape = FALSE,caption = "Repertoire")%>%
kable_paper(c("hover", "striped"), full_width = F)%>%
column_spec(c(1,2), bold = T, color = "black")%>%
scroll_box(height = "450px")| Project | Duration | Genre | ABRSM | Level | Standard | Length | Experience | Break | Started |
|---|---|---|---|---|---|---|---|---|---|
| Elton John - Rocket man | 47 | Modern | 7 | Advanced | Performance | 4.0 | 1130 | No | 2020-12-08 |
| Schumann - Träumerei | 14 | Romantic | 7 | Advanced | Average | 3.0 | 1087 | No | 2020-11-09 |
| Mozart - Allegro (3rd movement) K282 | 28 | Classical | 6 | Intermediate | Average | 3.3 | 1081 | Yes | 2020-11-05 |
| Ibert - Sérénade sur l’eau | 10 | Modern | 6 | Intermediate | Performance | 1.7 | 1038 | No | 2020-09-24 |
| Kuhlau - Rondo Vivace | 24 | Classical | 6 | Intermediate | Average | 2.2 | 1014 | No | 2020-08-03 |
| C. Hartmann - The little ballerina | 21 | Romantic | 6 | Intermediate | Performance | 2.0 | 998 | No | 2020-07-14 |
| Schumann - Lalling Melody | 5 | Romantic | 1 | Beginner | Average | 1.3 | 981 | No | 2020-06-28 |
| Schumann - Melody | 4 | Romantic | 1 | Beginner | Average | 1.0 | 972 | No | 2020-06-20 |
| Clementi - Sonatina no 3 - Mov 2 | 3 | Classical | 1 | Beginner | Performance | 1.0 | 952 | No | 2020-06-04 |
| Clementi - Sonatina no 3 - Mov 3 | 20 | Classical | 4 | Beginner | Performance | 2.0 | 952 | No | 2020-06-04 |
| Chopin - Waltz in Fm | 27 | Romantic | 6 | Intermediate | Performance | 2.0 | 895 | Yes | 2020-04-18 |
| Clementi - Sonatina no 3 - Mov 1 | 30 | Classical | 4 | Beginner | Performance | 2.7 | 877 | No | 2020-04-07 |
| Schumann - Kinderszenen 1 | 10 | Romantic | 5 | Intermediate | Average | 2.0 | 855 | No | 2020-03-25 |
| Bach - Prelude in G from Cello Suite No 1 | 25 | Baroque | 5 | Intermediate | Average | 2.5 | 788 | No | 2020-02-04 |
| Georg Böhm - Minuet in G | 7 | Baroque | 1 | Beginner | Average | 1.0 | 780 | Yes | 2020-01-27 |
| Bach - Invention 4 in Dm | 21 | Baroque | 5 | Intermediate | Performance | 1.7 | 777 | No | 2020-01-25 |
| Chopin - Contredanse in Gb | 23 | Romantic | 6 | Intermediate | Performance | 2.2 | 762 | No | 2020-01-16 |
| Bach - Minuet in Gm - 115 | 7 | Baroque | 1 | Beginner | Average | 1.3 | 750 | No | 2020-01-07 |
| Bach - Minuet in G - 114 | 4 | Baroque | NA | Advanced | NA | NA | 726 | No | 2019-12-06 |
| Elton John - Your song (Arr Cornick) | 36 | Modern | 5 | Intermediate | Performance | 3.3 | 713 | No | 2019-11-21 |
| Poulenc - Valse Tyrolienne | 17 | Modern | 5 | Intermediate | Performance | 1.7 | 562 | No | 2019-09-02 |
| Bach - Prelude in Cm - 934 | 25 | Baroque | 5 | Intermediate | Performance | 2.4 | 536 | No | 2019-08-15 |
| Schumann - Volksliedchen | 10 | Romantic | 2 | Beginner | Average | 1.8 | 501 | No | 2019-07-01 |
| Haydn - Andante in A | 39 | Classical | 5 | Intermediate | Average | 2.8 | 468 | Yes | 2019-06-08 |
| Schumann - Remembrance | 34 | Romantic | 5 | Intermediate | Performance | 2.2 | 422 | Yes | 2019-04-28 |
| Bach - Minuet in G - 116 | 8 | Baroque | NA | Advanced | NA | NA | 361 | Yes | 2019-03-04 |
| Bach - Invention 1 in C | 27 | Baroque | 5 | Intermediate | Performance | 1.7 | 350 | Yes | 2019-02-22 |
| Chopin - Waltz in Am | 26 | Romantic | 4 | Beginner | Performance | 2.5 | 305 | Yes | 2019-01-07 |
Question: How long would it take to learn a piece based on various factors?
Given the very limited data at the advanced level (Grade 7 ABRSM), those two pieces will be removed. One is an extreme outlier as well which will significantly impact our models.
model_data <- model_data%>%filter(ABRSM != 7)%>%droplevels()There are no missing values in the modelling dataset following the ETL process.
Let’s use some basic standardisation offered by the caret package such as centering (subtract mean from values) and scaling (divide values by standard deviation).
set.seed(123)
# set a backup variable
backup <- model_data
y <- model_data$Duration
# take out our response variable temporarily as we do not want this to be processed
model_data <- model_data %>%
select(-Duration)
# center and scale our data (BoxCox if needed)
preProcess_range_model <- preProcess(model_data, method=c("center", "scale"))
model_data <- predict(preProcess_range_model, newdata = model_data)
# append the Y variable back on with original values
model_data$Duration <- yGiven the small size of the dataset, bootstrapping resampling method will be applied.
train.control <- trainControl(method = "boot",
number = 25,
search = "random")# set number of clusters
clusters <- 4
# run them all in parallel
cl <- makeCluster(clusters, type = "SOCK")
# register cluster train in paralel
registerDoSNOW(cl)
# train models
model <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "ranger",
tuneLength = 100,
trControl = train.control)
model2 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "lmStepAIC",
tuneLength = 100,
trControl = train.control)
model3 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "lm",
tuneLength = 100,
trControl = train.control)
model4 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "ridge",
tuneLength = 100,
trControl = train.control)
model5 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "rf",
tuneLength = 100,
trControl = train.control)
model6 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "gbm",
tuneLength = 100,
trControl = train.control)
model7 <- train(Duration ~ ABRSM + Genre + Length + Cumulative_Duration + Break + Standard,
data = model_data,
method = "pls",
tuneLength = 100,
trControl = train.control)
# shut the instances of R down
stopCluster(cl)
# compare models
model_list <- list(ranger = model, lmStepAIC = model2, lm = model3, ridge = model4, rf = model5, gbm = model6, pls = model7)
model_comparison <- resamples(model_list)
# learning curves to indicate overfitting and underfitting
# hyper parameters
# https://topepo.github.io/caret/model-training-and-tuning.html#model-training-and-parameter-tuning
# https://topepo.github.io/caret/random-hyperparameter-search.htmlWe chose the Random Forest model as it was the best performing model. It is known as a model which is:
summary(model_comparison)##
## Call:
## summary.resamples(object = model_comparison)
##
## Models: ranger, lmStepAIC, lm, ridge, rf, gbm, pls
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ranger 3.021008 4.266143 4.946282 5.299442 6.243741 7.922518 0
## lmStepAIC 4.028395 5.690166 7.868767 8.503530 9.837814 17.181537 0
## lm 2.039521 5.772827 7.115061 7.641375 8.368587 16.692018 0
## ridge 3.406549 4.539113 4.997571 5.161525 6.107118 6.701543 13
## rf 1.996873 4.456777 5.186487 5.185054 5.888977 7.589383 0
## gbm 2.553344 5.186242 6.269624 6.590318 8.017100 11.058042 0
## pls 3.322617 4.289868 4.661391 4.939639 6.014574 6.505947 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ranger 3.864228 5.315607 6.162724 6.350531 7.357032 9.001666 0
## lmStepAIC 4.956079 7.563875 9.427272 10.849266 12.508605 22.507683 0
## lm 2.072607 7.474750 8.841582 9.608014 10.506074 21.748986 0
## ridge 3.789583 5.322770 6.175154 6.371706 7.532710 8.867574 13
## rf 2.845897 5.896209 6.423869 6.527895 7.398713 9.887790 0
## gbm 2.863830 7.040941 8.600280 8.355581 9.551342 12.558296 0
## pls 3.398100 5.122330 5.592774 5.753895 6.480938 8.004247 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## ranger 0.395533303 0.5110174 0.6963843 0.6508775 0.7521799 0.9194498 0
## lmStepAIC 0.032085810 0.2440455 0.5301134 0.4420157 0.5929040 0.8193574 0
## lm 0.002773983 0.3152272 0.5205367 0.4709629 0.6646064 0.9594312 0
## ridge 0.156077140 0.6851875 0.7338446 0.7108150 0.8053173 0.9515817 13
## rf 0.125014858 0.5885457 0.6937981 0.6618163 0.7600447 0.9296281 0
## gbm 0.113071932 0.3462277 0.5310918 0.5266083 0.6574668 0.9769281 0
## pls 0.010994365 0.6228905 0.7980685 0.7319133 0.8426825 0.9723046 0
Based on our regression model, it does not look like we have significant multicollinearity between the full model variables so we can continue as it is.
tidy(vif(model3$finalModel))%>%
rename(VIF = x)%>%
mutate(VIF = round(VIF, 1))%>%
arrange(desc(VIF))%>%
kbl(caption = "Variance Inflation Factor (VIF)")%>%
kable_paper("hover", full_width = F)| names | VIF |
|---|---|
| ABRSM5 | 5.3 |
| Cumulative_Duration | 4.3 |
| ABRSM6 | 4.2 |
| ABRSM4 | 3.8 |
| Length | 2.8 |
| GenreClassical | 2.6 |
| StandardPerformance | 2.2 |
| ABRSM2 | 2.0 |
| GenreRomantic | 2.0 |
| BreakNo | 2.0 |
| GenreModern | 1.7 |
selected_model <- model5
#Saving the model
saveRDS(selected_model, file = "model.rda")
#get predictions
predictions <- predict(selected_model, model_data)
#create dataset
model_data2 <- model_data
model_data2$Predicted <- predictions
model_data2$Actual <- model_data$Duration
model_data2$Residuals <- model_data2$Actual - model_data2$Predicted
# model_data2 <- model_data%>%
# mutate(Actual = as.numeric(Duration),
# Predicted = as.numeric(predictions),
# Residuals = Actual - Predicted)%>%
# select(Predicted, Actual, Residuals, Project, Level, Genre)
#visualise predicted vs actual
ggplotly(
ggplot(model_data2, aes(Predicted, Actual, label = Residuals))+
geom_point(aes(text = Project, fill = Level), size = 3, shape = 21, col = "black", alpha = 0.75)+
geom_smooth(method = "loess", col = "red", lwd = 1, se = FALSE, group = 1)+
geom_abline(lty = "dashed", lwd = 0.5, col = "gray")+
coord_cartesian(xlim = c(0,50), ylim = c(0,50))+
labs(col = NULL)+
scale_fill_tron()+
theme_ipsum_es() +
theme(legend.position = "top")
) %>%
layout(legend = list(orientation = "h", x = 0.4, y = 1.2))We can see that the residuals are mostly situated around 0.
ggplot(model_data2, aes(Residuals, fill = ..count..))+
geom_histogram(binwidth = 1, col = "black")+
geom_vline(aes(xintercept=mean(Residuals)), lwd = 1, lty = 2) +
labs(x="Residuals",
y= "Total occurences")+
scale_fill_gradient(low="yellow", high="red")+
theme_ipsum_es()+
theme(legend.position = "none")Looking at the variability of errors, there is still a tendency to over-predict for pieces that took very little and under-predict for the more difficult ones. There could be two main reasons for this:
ggplotly(
ggplot(model_data2, aes(Actual, Residuals, col = Level, label = Predicted))+
geom_hline(yintercept = 0, size = 3, color = "grey52")+
geom_point(aes(text = Project), alpha = 0.75, size = 3)+
geom_smooth(method = "loess", col = "red", se = FALSE)+
labs(col = NULL)+
scale_color_tron()+
theme_ipsum_es()
) %>%
layout(legend = list(orientation = "h",x = 0.4, y = 1.2))We can see that the Random Forest performed significantly better than the Linear Regression model. This isn’t surprising since there might be non-linear trends within the data, and RFs are known to be more accurate.
tidy(compare_models(model3, model5))%>%
kbl(caption = "Model 1 vs model 2")%>%
kable_paper("hover", full_width = F)| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 3.080119 | 3.333892 | 0.0027725 | 24 | 1.173323 | 4.986915 | One Sample t-test | two.sided |
plot(model5, main = "The most optimal model was that with 6 predictors", col = "orange", lwd = 1.5)We can now see that the most important variables seemed to be the length of the piece, my experience prior to starting a piece and time difficulty of the piece. These were also confirmed by the linear regression model.
imp <- as.matrix(varImp(model5)$importance)%>%
as.data.frame()%>%
rename(Importance = Overall)%>%
mutate(Feature = as.factor(rownames(.)),
Feature = reorder(Feature, Importance))
ggplot(imp, aes(Feature, Importance))+
geom_segment(aes(Feature, y = 0, xend = Feature, yend = Importance), col = "black", size = 1.5) +
geom_point(size = 10, col = "orange")+
geom_text(aes(label = paste(round(Importance), "%", sep = "")), color = "black", size = 3, check_overlap = TRUE)+
scale_color_tron()+
scale_fill_tron()+
theme_ipsum_es()+
coord_flip()+
labs(title = "Variable importance ranking")+
theme(axis.text.x = element_blank(),
axis.ticks = element_blank())#plot(varImp(model5))